Cython: Optimising for speed

This notebook aims to take the reader through a realworld example of increasing the speed on an algorithm. The given example is that of computing the normals for a given triangle mesh (points and a trilist). Computing the per-vertex normal for a mesh is an intensive operation that yields poor performance in pure Python. This is due to the need to loop over every triangle and and sum the per-triangle normal that every vertex is a member of. This notebook is not designed to describe Cython syntax or basics. It is assumed that the reader has some understanding and preferably experience with writing Cythonised functions.

Algorithm Pseudocode

foreach face in faces:
    face_normal = crossproduct(vertices[face[1]] - vertices[face[0]], 
                               vertices[face[2]] - vertices[face[0]])
    foreach v in face:
        normalise(face_normal)
        vertices[v].in_faces.append(face_normal)

foreach vertex in vertices:
    normal = (0,0,0)
    for face in vertex.in_faces:
        normal += face_normal
    normalise(normal)

crossproduct(v0, v1):
        v0.y * v1.z - v0.z * v1.y,
        v0.z * v1.x - v0.x * v1.z,
        v0.x * v1.y - v0.y * v1.x,

Begin

We begin by loading an appropriate mesh.



In [ ]:

    
import menpo.io as pio
import numpy as np

shapeimage = pio.import_image('/vol/atlas/databases/frgc/spring2003/04201d302.abs')
tris = shapeimage.mesh.trilist
points = shapeimage.mesh.points

print shapeimage

Pure Python Implementation

This implementation uses numpy and broadcasting to achieve it's goals.



In [ ]:

    
def normalise(vec):
    # Avoid divisions by almost 0 numbers
    # np.spacing(1) is equivalent to Matlab's eps
    d = np.sqrt(np.sum(vec ** 2, axis=1))
    d[d < np.spacing(1)] = 1.0
    return vec / d[..., None]

def py_compute_normal(vertex, face):
    nface = face.shape[0]
    nvert = vertex.shape[0]
    
    # Calculate the cross product (per-face normal)
    normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                       vertex[face[:, 2], :] - vertex[face[:, 0], :])
    normalf = normalise(normalf)
    
    # Calculate per-vertex normal
    normal = np.zeros([nvert, 3])
    for i in xrange(nface):
        f = face[i, :]
        for j in xrange(3):
            normal[f[j], :] += normalf[i, :]
            
    # Normalize
    normal = normalise(normal)
    
    # Enforce that the normal are outward
    v = vertex - np.mean(vertex)[..., None]
    s = np.sum(v * normal, axis=1)
    if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
        # flip
        normal = -normal
        normalf = -normalf
    
    return normal, normalf

If we then time this function, we can see that it takes about 3 seconds (on my Intel(R) Xeon(R) CPU E5-1650 @ 3.20GHz with 32GB of RAM) for 123160 triangles and 61599 points.



In [ ]:

    
%timeit py_compute_normal(points, tris)

Naive Cython

This is obviously far too slow to be of any use. Therefore, we naively port this method to Cython. Cython is useful for code where tight looping is unavoidable, as is the case in computing the per-vertex normal. This is because it pre-compiles as much of the code as possible down to C, which is very efficient at tight looping. To compile Cython code, we have to load the Cython magic extension



In [ ]:

    
%load_ext cythonmagic

The Cython extension gives us the %%cython cell magic where we can put raw Cython code which will compiled on execution. TO get started with Cython, we note that the majority of Cython's speedup comes from the fact that we statically type variables. Therefore, we always have to import some C code via the cimport statement. For example, to use numpy, we could use:



In [ ]:

    
%%cython
import numpy as np

cimport cython
cimport numpy as np

Note that we have to Python import Numpy AND cimport it. Therefore, a simple Cython function using numpy would look like:



In [ ]:

    
%%cython
import numpy as np

cimport cython
cimport numpy as np

def my_pow(double x):
    return np.power(x, 2)

print my_pow(2.0)

It's important to note that there are 3 kinds of functions definitions in Cython:

def
- This is a Python function. It is called via Python and thus has all the overhead of being called by Python. Any C-code will have to call out of Python
- Parameters are Python objects which are then explicitly converted to static types if specified
- Returns a Python object
cdef
- This is a C signature and can ONLY run from a Cython context. It cannot be called by pure Python code.
- Parameters are converted to static type by the caller
- Return type can be statitically defined
cpdef
- This is a mixed signature whereby Cython automatically builds a pure Python wrapper around a cdef function. So Python calls the wrapper and C calls the cdef function.
- Parameters are converted to C type of Python wrapper
- Return types are statically defined and marshalled by Python wrapper

So, to create a naive implementation of our Python function, in Cython, we define cpdef function as follows:



In [ ]:

    
%%cython

import numpy as np
cimport numpy as np
cimport cython


cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
    # Avoid divisions by almost 0 numbers
    cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
    d[d < np.spacing(1)] = 1.0
    return vec / d[..., None]
 

cpdef cy_compute_normal_naive(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
    cdef int nface = face.shape[0]
    cdef int nvert = vertex.shape[0]
    
    # Calculate the cross product (per-face normal)
    cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                                                             vertex[face[:, 2], :] - vertex[face[:, 0], :])
    normalf = cy_normalise_naive(normalf)
    
    # Calculate per-vertex normal
    cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
    cdef np.ndarray[int, ndim=1] f
    for i in xrange(nface):
        f = face[i, :]
        for j in xrange(3):
            normal[f[j], :] += normalf[i, :]
    
    # Normalize
    normal = cy_normalise_naive(normal)
    
    # Enforce that the normal are outward
    cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
    cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
    if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
        # flip
        normal = -normal
        normalf = -normalf
    
    return normal, normalf

If we then time this function, we can see that it takes about 1.8 seconds (on my Intel(R) Xeon(R) CPU E5-1650 @ 3.20GHz with 32GB of RAM) for 123160 triangles and 61599 points. This represents an approximately 1.6x speedup just by naively moving the code in to a Cython function. Other than the static typing, the code is almost identical.

Note: There are decorators such as @cython.boundscheck(False) and @cython.wraparound(False) that can provide speedups by telling Cython that you guarantee the kinds of accesses arrays will have inside the function. See here for more information.



In [ ]:

    
%timeit cy_compute_normal_naive(points, tris)

Optimising Cython

However, we can do better than this! In order to give us a better indiciaton, Cython provides the ability to pass flags in for execution. These can be compile time flags, or special running flags. The flag we are interested in is -a. This provides an output that colour codes the typing that is going on within the Cython function. Yellow backgrounds indicate function calls back in to Python (which is slow), and white/clear backgrounds represent pure C calls. If we run this on our naive implementaton, we get the following:



In [ ]:

    
%%cython -a

import numpy as np
cimport numpy as np
cimport cython


cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
    # Avoid divisions by almost 0 numbers
    cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
    d[d < np.spacing(1)] = 1.0
    return vec / d[..., None]
 

cpdef cy_compute_normal_naive(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
    cdef int nface = face.shape[0]
    cdef int nvert = vertex.shape[0]
    
    # unit normals to the faces
    cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                                                             vertex[face[:, 2], :] - vertex[face[:, 0], :])
    normalf = cy_normalise_naive(normalf)
    
    # unit normal to the vertex
    cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
    cdef double[:] f
    for i in xrange(nface):
        f = face[i, :]
        for j in xrange(3):
            normal[f[j], :] += normalf[i, :]
    
    # normalize
    normal = cy_normalise_naive(normal)
    
    # enforce that the normal are outward
    cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
    cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
    if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
        # flip
        normal = -normal
        normalf = -normalf
    
    return normal, normalf

Looking above, we see that the majority of the code is still making calls back in to Python. In particular, the slow vetex loop is making a Python call every iteration. Therefore, we want to try and remove this.



In [ ]:

    
%%cython -a

import numpy as np
cimport numpy as np
cimport cython


cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
    # Avoid divisions by almost 0 numbers
    cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
    d[d < np.spacing(1)] = 1.0
    return vec / d[..., None]
 
    
cpdef cy_compute_normal_better(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
    cdef int nface = face.shape[0]
    cdef int nvert = vertex.shape[0]
    
    # unit normals to the faces
    cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                                                             vertex[face[:, 2], :] - vertex[face[:, 0], :])
    normalf = cy_normalise_naive(normalf)
    
    # unit normal to the vertex
    cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
    cdef int f0, f1, f2
    for i in range(nface):
        f0 = face[i, 0]
        f1 = face[i, 1]
        f2 = face[i, 2]
        for j in range(3):
            normal[f0, j] += normalf[i, j]   
            normal[f1, j] += normalf[i, j]       
            normal[f2, j] += normalf[i, j]
    
    # normalize
    normal = cy_normalise_naive(normal)
    
    # enforce that the normal are outward
    cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
    cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
    if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
        # flip
        normal = -normal
        normalf = -normalf
    
    return normal, normalf

Eureka! By turning lines 24-36 to pure C, just by guaranteeing their accesses as C types, we have sped up our function to approximately 80 ms. This represents an approximate 38x speedup from the original! And all we did was partially unwrap a single loop. This is the key when trying to optimise Cython code. You need to ensure that all loops make as few calls in to Python code as possible.



In [ ]:

    
%timeit cy_compute_normal_better(points, tris)

Diminishing Returns

So now the game has become trying to turn as much of that Yellow code in to white code. Note that there is certainly a diminishing law of returns going on here. Our previous optimisation was almost certainly the largest jump in performance we will be able to achieve. Given that we don't gave any other loops, we are unlikely to get large 100+% jumps in performance. Numpy calls are already vectorized and manually unrolling them in to loops will not yield a very big performance boost. If we run the magic function %prun, this will give us profiling information about the functions that are called. We use the -r flag in order to return the profiler object so that we can print it in to the cell. Normally, this need just be called as:

%prun cy_compute_normal_better(points, tris)

which opens up a seperate window in the notebook.



In [ ]:

    
p = %prun -r cy_compute_normal_better(points, tris)
p.print_stats()

The profiling output tells us that the majority of the time is spent inside the Cython function. However, almost half the time is spent inside the numpy cross product function. Looking at the source code of numpy's cross product shows us that it does a bunch of checks to try and ensure that the shapes of the vectors match. However, we know that are guaranteed to have standard Nx3 vectors. So, what happens if we roll our own cross product method?



In [ ]:

    
%%cython -a

import numpy as np
cimport numpy as np
cimport cython

        
cdef np.ndarray[np.float64_t, ndim=2] normalise(np.ndarray[np.float64_t, ndim=2] vec):
    # Avoid divisions by almost 0 numbers
    cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
    d[d < np.spacing(1)] = 1.0
    return vec / d[..., None]
     

cdef inline np.ndarray[np.float64_t, ndim=2] cross(double[:, :] x, double[:, :] y):
    cdef np.ndarray[np.float64_t, ndim=2] z = np.empty_like(x)
    cdef int n = x.shape[0]
    for i in range(n):
        z[i, 0] = x[i, 1] * y[i, 2] - x[i, 2] * y[i, 1]
        z[i, 1] = x[i, 2] * y[i, 0] - x[i, 0] * y[i, 2]
        z[i, 2] = x[i, 0] * y[i, 1] - x[i, 1] * y[i, 0]
    
    return z


cpdef cy_compute_normal(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
    cdef int nface = face.shape[0]
    cdef int nvert = vertex.shape[0]
    
    # unit normals to the faces
    cdef np.ndarray[np.float64_t, ndim=2] normalf = cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                                                          vertex[face[:, 2], :] - vertex[face[:, 0], :])
    normalf = normalise(normalf)
    
    # unit normal to the vertex
    cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
    cdef int f0, f1, f2
    for i in range(nface):
        f0 = face[i, 0]
        f1 = face[i, 1]
        f2 = face[i, 2]
        for j in range(3):
            normal[f0, j] += normalf[i, j]   
            normal[f1, j] += normalf[i, j]       
            normal[f2, j] += normalf[i, j]
    
    # normalize
    normal = normalise(normal)
    
    # enforce that the normal are outward
    cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
    cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
    if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
        # flip
        normal = -normal
        normalf = -normalf
    
    return normal, normalf



In [ ]:

    
%timeit cy_compute_normal(points, tris)
print ""
p = %prun -r cy_compute_normal(points, tris)
p.print_stats()

Basic Results

We've now reduced the execution time by almost a third again. Looking at the profiler output, we see that all of the time is simply spent inside the Cython function. Since all the operations are vectorized, we are unlikely to see anything but very incremental improvements. However, we've gone from nearly 3s down to around 50ms. Looking at the code, we've changed very little from the original Python version. Easily the most difficult part was rolling our own cross product, and even that was not really a necessary optimisation.

Now, just for our own piece of mind, we'll check that our optimised Cython version produces the same output as the original Python implementation.



In [ ]:

    
cy_normal, cy_normalf = cy_compute_normal(points, tris)
py_normal, py_normalf = py_compute_normal(points, tris)

print np.allclose(cy_normal, py_normal)
print np.allclose(cy_normalf, py_normalf)

Numba

By request I have implemented the algorithm as best I could in Numba. I should note that:

I have no idea what I'm doing using Numba, so this is unlikely to be optimised
I had enormous trouble even getting Numba to run on Ubuntu 13.04. I had to build and install my own LLVM-3.2.
It took my about 2 hours just to get something to compile, and I had to resort to print statements because the output is so useless

I've tried to comment why I did certain unrollings, though I don't justify them because I don't really understand them. I assume that specifying the expected type will help the optimisation - but I honestly have no idea. Presumably the np.greater and np.less problem is a bug in Numba?



In [ ]:

    
import numba
from numba.decorators import autojit, jit
from math import sqrt


# Had to unroll this otherwise it complained about some strange python object
# coercion error
@autojit
def numba_normalise(vec):
    # Avoid divisions by almost 0 numbers
    # np.spacing(1) is equivalent to Matlab's eps
    n = vec.shape[0]
    for i in range(n):
        d = sqrt(vec[i, 0] * vec[i, 0] +
                 vec[i, 1] * vec[i, 1] +
                 vec[i, 2] * vec[i, 2])
        if d < np.spacing(1):
            d = 1.0

        vec[i, 0] /= d
        vec[i, 1] /= d
        vec[i, 2] /= d


# If I didn't roll my own cross product then computing
# the normals actually takes LONGER than the pure Python implementation
@jit(argtypes=(numba.double[:, :], numba.double[:, :]))
def cross_numba(x, y):
    output = np.empty_like(x)
    n = x.shape[0]
    for i in range(n):
        output[i, 0] = x[i, 1] * y[i, 2] - x[i, 2] * y[i, 1]
        output[i, 1] = x[i, 2] * y[i, 0] - x[i, 0] * y[i, 2]
        output[i, 2] = x[i, 0] * y[i, 1] - x[i, 1] * y[i, 0]
    return output


@jit(argtypes=(numba.double[:, :], numba.int32[:, :]))
def numba_compute_normal(vertex, face):
    nface = face.shape[0]

    # Calculate the cross product (per-face normal)
    normalf = cross_numba(vertex[face[:, 1], :] - vertex[face[:, 0], :],
                          vertex[face[:, 2], :] - vertex[face[:, 0], :])
    numba_normalise(normalf)

    # Calculate per-vertex normal
    normal = np.zeros_like(vertex)
    for i in range(nface):
        f = face[i, :]
        for j in range(3):
            normal[f[j], :] += normalf[i, :]

    # Normalize
    numba_normalise(normal)

    # Enforce that the normal are outward
    v = vertex - np.mean(vertex)[..., None]
    s = np.sum(v * normal, axis=1)
    s_gt_sum = 0
    s_lt_sum = 0

    # Had to expand this loop otherwise numba complained:
    # 'only length-1 arrays can be converted to Python scalars'
    # On the calls to np.greater and np.less
    for i in range(s.shape[0]):
        if s[i] > 0:
            s_gt_sum += 1
        elif s[i] < 0:
            s_lt_sum += 1

    if s_gt_sum < s_lt_sum:
        # flip
        normal = -normal
        normalf = -normalf

    return normal, normalf

As for the results, we've gone down to around 800ms, which is definitely an improvement. Interestingly, this is on par with a Matlab implementation that I have (which is known to be jitted). To sanity check we will also check that the output is correct.



In [ ]:

    
%timeit numba_compute_normal(points, tris)



In [ ]:

    
numba_normal, numba_normalf = numba_compute_normal(points, tris)
py_normal, py_normalf = py_compute_normal(points, tris)

print np.allclose(numba_normal, py_normal)
print np.allclose(numba_normalf, py_normalf)